A Solvable Sequence Evolution Model and Genomic Correlations 
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We study a minimal model for genome evolution whose elementary processes are single site muta- 
tion, duplication and deletion of sequence regions and insertion of random segments. These processes 
are found to generate long-range correlations in the composition of letters as long as the sequence 
length is growing, i.e., the combined rates of duplications and insertions are higher than the deletion 
rate. For constant sequence length, on the other hand, all initial correlations decay exponentially. 
These results are obtained analytically and by simulations. They are compared with the long-range 
correlations observed in genomic DNA, and the implications for genome evolution are discussed. 
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Over a decade ago, long-range correlations in the se- 
quence composition of DNA have been discovered P, 
I3, 1^. With the rapidly growing availability of whole- 
genome sequence data, the composition of genomic DNA 
can now be studied systematically over a wide range of 
scales and organisms. The statistical analysis is quite in- 
tricate since genomic DNA is a rather "patchy" statistical 
environment Q: it consists of genes, noncoding regions, 
repetitive elements etc., and all of these substructures 
have a systematic influence on the local sequence compo- 
sition. Variations in composition along the genome have 
been studied extensively by a number of different meth- 
ods H US USES [illl3> and it is now well estabhshed 
that long-range correlations in base composition appear 
in the genomes of many species. These can be mea- 
sured, for example, by the autocorrelation function C(r) 
of the GC-content, which measures the likelihood of find- 
ing G-C Watson-Crick pairs at a distance of r bases along 
the backbone of the DNA molecule. However, the form 
of these correlations is much more complex than simple 
power laws. Within one chromosome, there is often a va- 
riety of different scaling regimes and effective exponents, 
and sometimes no clear scaling at all. Moreover, the 
effective exponents of comparable scaling regimes vary 
considerably between different species, and even between 
different chromosomes of the same species 0, 0, . 

Despite the ubiquity of genomic correlations, little is 
known about their evolutionary origin. In this Letter, we 
address the question whether the observed correlations 
can be explained quantitatively by a biologically realis- 
tic "minimal" model of sequence evolution. We take into 
account four well known elementary evolutionary modes: 
single site mutations, duplications and deletions of exist- 
ing segments of the sequence, and insertions of random 
segments. The duplication processes are believed to be 
a crucial mechanism of genome growth 

Emilii; the 

length of the duplicated segments ranges from single let- 
ters to thousands of letters as in the case of gene dupli- 
cations. The model is minimal in the sense that all four 
elementary modes are local stochastic processes compat- 



ible with neutral evolution, i.e., they do not require any 
assumption of natural selection. An alternative possible 
reason for the observed correlations may be long-range 
interactions likely to be caused by natural selection for 
a specific local GC-content. An example of such a selec- 
tive process is the clustering of genes in some regions of a 
chromosome [l^ , but no plausible mechanism producing 
long-range interactions has been proposed so far. 

Li's original work has shown that already a simple 
stochastic process consisting of duplications and muta- 
tions of single letters leads to generic power law correla- 
tions in the sequence composition Here we analyze 
in detail the generalized sequence evolution model intro- 
duced above. In particular, we calculate the stationary 
two-point correlation function C(r). It is of power law 
form, C(r) ~ r^", with a decay exponent a depend- 
ing on only two effective parameters, which are simple 
functions of the rates of the elementary processes. These 
long-range correlations are generic as long as the rates 
of the processes result in a growing sequence. At con- 
stant sequence length, however, the stationary correla- 
tions in sequence composition vanish, and initial corre- 
lations from a previous growth phase decay. Our ana- 
lytic results (which differ from Li's approximate expres- 
sions 01 and the results of [31 ) are in excellent agree- 
ment with our numerical simulations. We use these re- 
sults to infer from measured values of a a lower bound 
on the growth rate of the genome, which can be com- 
pared with independent estimates. The implications of 
our findings on the evolution of mammalian genomes are 
discussed at the end of this Letter. 

Sequence evolution model. — The stochastic evolution 
model generates sequences (si, . . . , sn) of variable length 
N. For simplicity, their letters are taken from a binary al- 
phabet; Sk — ±1. (In the application to genomic systems, 
Sk = +^ denotes a GC-pair and Sfc = — 1 an AT-pair at 
backbone position k.) The elementary evolutionary steps 
are mutations, duplications, insertions, and deletions of 
single letters (the generalization to segments will be dis- 
cussed below). They are Markov processes with rates /x. 
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rate fj, 
rate S 
rate 7''' 
rate 7", 



(1) 



where a; = ±1 denotes an uniformly distributed random 
letter. Duplication and insertion events introduce a new 
letter next to an exiting one and shift all subsequent let- 
ters one position to the right, thereby increasing the se- 
quence length by 1. Conversely, deletions shorten the 
length by 1. This type of Markov evolution model is 
widely used in computational biology, forming the statis- 
tical basis of sequence alignment algorithms Run- 
ning all four processes over a time t produces a statistical 
ensemble of sequences; the corresponding averages are 
denoted by {. . .){t). This ensemble is characterized by 
the rates S, /i, 7+, 7", and by the initial sequence. Here 
we use sequences of length 1 with a fixed letter, (si) = 1, 
or a random letter, (si) = x. 

After a time t, the sequences have an average length 
{N){t) — exp(At) with the effective growth rate 



A = (5 + 7^ 
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(2) 



We are interested in two dynamical regimes, sequence 
growth from a single- letter initial state (i.e., A > 0) and 
the evolution of sequences at stationary length (TV) ^ 1 
(i.e., A = 0), to which we now turn in order. 

Growth dynamics and stationary correlations. — The 
composition bias of the sequences at position k is mea- 
sured by the expectation value (sfe)(t). It is easy to show 
that any initial composition bias decays due to mutations 
and random insertions. We note that each insertion can 
be regarded as a duplication with a subsequent mutation 
in half of the cases, resulting in an effective mutation rate 



Mcff = M + 7^/2- 



(3) 



We obtain (sfe)(t) cx exp(— 2/icff^) for fixed initial con- 
dition, while (sfe)(t) = for random initial conditions. 
The composition correlation C(r) = {skSk+r){t) between 
two sequence positions at distance r is affected by all 
four processes and is independent of the initial condition. 
Its evolution equation can be derived by writing it as 
C(r, t) = Peq(r, t) -Pop(r, t), where Pcq{r, t) and Pop(r, t) 
denote the joint probabilities of finding two symbols of 
equal and opposite signs, respectively, at a distance r. 
The Master equation for Poq(r, t) takes the form 



d 

dt ' 



2Mcff hPcq(r,t)-f Pop(r,t)] 

+ [r5 + (r - 1)7+] [Peq(r - - Pcq(r,i)] 
+r7- [Peq(r + l,i)-Peq(r,t)]. (4) 



The first term on the r.h.s. describes the change in 
Peq(r, due to mutations and random insertions, while 



the second term specifies the probability current due to 
duplication of a site in the interval {k^k -\- r — 1) or in- 
sertion of a new site in the interval (fc, fc + r — 2). The 
third term gives the corresponding current due to dele- 
tions. By exchanging Peq and Pop, we obtain a similar 
equation for Pop(r,t). Hence we have 



d_ 
di 



C{r,t) = -VcffC(r) 

+ [r5 + {r- 1)7+] [C(r - 1) - C{r)] 
+r-i- [C{r + l)-C{r)]. (5) 



For the special case with only single-letter duplications 
and mutations [5, /x > 0, 7"*" 
lent to Li's original model 

form for the stationary C(r) by solving the recursion 



7_^= 7 =0), which is equi va- 
il [131 , we find a simple analytical 



C{r) = 



C{r - 1) 



with a = — 




(6) 



and the initial value C(0) = 1. This gives 
r(r-h l)r(l-Ka) _ 



C{r) 



T{r + 1 



1 



P(r,a), (7) 



where T{x) is the gamma function and B{x,y) the beta 
function. Evaluating its asymptotic behavior for x ^ 1, 



B{x,y) oc T{y) x ^ 
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then produces the algebraic decay C(r) oc r~". For 
the general case including insertions and deletions, the 
asymptotic decay can still be obtained exactly in the 
continuum limit. For r 3> 1 and (5 > 0, the difference 
equation ([SJl becomes the differential equation 

-C(r, t) = -4MeffC(r, t) - r\—C{r, t) (8) 
ot or 

with the effective rates /icfr and A defined by Q and 
This has the stationary solution 



C(r) oc r 



with 



4/Xcff 
A ■ 



(9) 



Eq. © clearly shows the mechanism generating long- 
range correlations in this type of sequence evolution 
model. Correlations are continuously produced at small 
scales by duplications and transported to larger distances 
by the net exponential expansion of the sequence (re- 
sulting from duplications and insertions/deletions). On 
the other hand, correlations decay exponentially due to 
processes randomizing the sequence (i.e., mutations and 
random insertions). The competition between expan- 
sion and randomization produces the algebraic decay 
C{r) oc r~", which is highly universal. Microscopic de- 
tails of the evolution processes are irrelevant, the expo- 
nent a is determined by a simple balance between the 
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growth rate A and the effective mutation rate iieS- Hence, 
an extended model containing duphcations, deletions and 
random insertions of sequence segments of finite length 
£ — 1, 2, fniax with respective rates 5;, 7^ 
still has the same asymptotics ® for N{t) 3> 
r » ^max- The effective rates © are now given by 



and 7^ 
max and 



A 



e + lT -le \ ^ Moff = M + 



^E^^/- (10) 



This asymptotics can again be proved from an exact Mas- 
ter equation similar to Q [ij]. The extended model is 
important for genomic evolution since strong long-range 
correlations (i.e., small values of a) can be the combined 
result of segment duplications with different values of £. 
Their individual rates Si might be small and difficult to 
assess but the cumulative rate A can still be estimated. 

Stationary-length dynamics and time- dependent corre- 
lations. — It is obvious from Eq. (|SJ| that stationary long- 
range correlations only exist as long as the sequence 
grows, i.e. for A > 0. Consider now the following evolu- 
tionary scenario: sequence growth with rate Ai > up 
to a length No — A^(to), followed by a second phase with 
A2 = and {N){t) = Nq for t> U). The time-dependent 
solution of Eq. © for the asymptotics of C(r, t) is then 

C(r,t) = C(r,io) e-'^^""'^* oc r-4Mcff/Aig-4M„«At ^^^^ 

with At = i — to > 0. In the second phase, the long-range 
tails of C{r,t) are preserved but their amplitude decays 
with a characteristic time scale r = 

Numerical results. — We have performed extensive 
Monte Carlo simulations of our model. During each time 
step At = [(^ + Yl,A^i + 7/ + we choose a 

random site and apply one of the elementary processes 
with its relative weight. For a single realization of this 
dynamics, the correlation function C{r) is well approxi- 
mated by the sequence average (A^ — r)^^ X^fcLi'" SkSk+r- 
Further averaging over 100 realizations produces very ac- 
curate measurements of C(r). 

Fig.^a) shows the numerical C(r) for the single- letter 
duplication-mutation dynamics with various rates, which 
is in excellent agreement with the analytic expression |2J . 
The same is shown in Fig. ^b) for the general case with 
all types of processes present, verifying the asymptotic 
behavior ^ and H10|) . For completeness, we have also 
obtained power spectra and the mutual information func- 
tion, as defined in which have the expected decay 
exponents \ — a and 2a, respectively. 

The dynamical build-up of these correlations for grow- 
ing sequences is seen in Fig.|5{a), which shows C(r, t) at 
various intermediate times of the growth process. The 
correlation rapidly converges to the stationary form for 
all distances r <, N{t). This should be compared with 
the time-dependence of C{r,t) at constant length in 
Fig. |2Ib), which shows an algebraic tail with an expo- 
nentially decreasing amplitude as predicted by Eq. 
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FIG. 1: Stationary C(r) at different rates of the elementary 
processes, (a) Single-letter duplication-mutation model: Nu- 
merical results (circles) and the analytical form lO (lines) for 
fi = 1, S varying, (b) Full model: Numerical results (circles) 
with the analytic asymptotics ^ and 1101 (lines) for = 1 
and varying rates of the other processes (rates not specified in 
the plot are zero). The dynamics of the sequences was simu- 
lated until they reached a length of = 2^"^ ~ 10*; C(r) was 
averaged over the sequence and over 100 runs. 



Genomic evolution. — As pointed out above, the pro- 
cesses discussed here build a minimal model for dynami- 
cally generated long-range correlations along a sequence. 
But can this model explain the observed correlations in 
genomic DNA? The correlation function C(r) along hu- 
man chromosomes shows a rather slow algebraic decay on 
distance scales 10^ < r < 10^ with typical effective ex- 
ponents a « 0.1 [lollll| . We have confirmed these mea- 
surements and found them to be consistent with sequence 
data from other mammals |0| . A lower bound of the ef- 
fective mutation rate in mammals is ^cff ~ 2 • 10^^a~^ 
per site Assuming stationary growth, we can use 

these values of a and HeS to derive a lower bound on 
the genomic growth rate A, resulting in a minimum value 
A « 10~^a~^ per site according to Eq. I^. However, this 
rate is much too high. Our genome would have expanded 
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FIG. 2: Time-dependent correlations C(r,i). (a) Build-up 
of long-range correlations by stationary growth. Measured 
C{r,t) at various intermediate lengths N{t) = 10^ 10*, 10*^ 
(symbols) together with the stationary form Q (line) for 
^ = 1, 5\ = 5 = 8, all other parameters are zero, (b) De- 
cay of correlations during sequence evolution at stationary 
length A'o = 10®. Measured C{r,t) at various times At (sym- 
bols) together with the analytic decay of the long-range tail 
given by Eq. Hll|l (lines) . Note that there are still correlations 
remaining on short length scales. 



much faster than it is observed since the current human 
genome contains N w 3-10^ base pairs and, assuming the 
above rate of genome expansion, would have contained 
only about 4 • 10^ base pairs at the time of mammalian 
radiation about 90 million years ago. This can clearly be 
rejected since approximately 40% of the human genome 
can be aligned to the mouse genome, representing most 
of the orthologous sequences that remain in both lineages 
from the common ancestor ■ 

Over longer evolutionary periods, genomic expansion 
phases with rates A ~ 10~''a~^ cannot be ruled out if we 
assume the history of the genome has been a punctuated 
process, with such expansion phases followed by periods 
of approximately constant length. In the human genome, 
there is by now ample evidence for growth by segmental 



duplications with various segment lengths [23, |2^. In 
a punctuated growth process, correlations are produced 
and transported during the expansion phases. During 
the stationary phases, the previously established corre- 
lations decay as given by Eq. Ullfl. In mammals, the 
last period of rapid expansion has been the mammalian 
radiation, and the characteristic time scale of the decay 
is r « 100 Myr. Correlations present or generated at 
the time of the mammalian radiation would hence still 
persist. The succession of several distinct growth phases 
with different values of A and ficS could even explain cor- 
relations C(r) with several scaling regimes as found in 
human chromosomes We conclude that the correla- 
tions observed in mammals are compatible with a punc- 
tuated expansion-randomization process. Of course, this 
does not rule out other causes. Indeed, the rather diverse 
functional forms found in different species may point to- 
wards more than one generating mechanism. If genomic 
expansion proves to be a significant contribution, compo- 
sition correlations could be the "background radiation" 
of genomics, allowing us to trace the history of genomes 
far back in evolutionary time. 
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